Community Detection as an Inference Problem 
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We express community detection as an inference problem of determining the most likely arrange- 
ment of communities. We then apply belief propagation and mean-field theory to this problem, and 
show that this leads to fast, accurate algorithms for community detection. 



Community detection is a well-studied problem in 
networks [l|. This is the problem of dividing a net- 
work into communities, such that nodes within the same 
community tend to be connected by links, while those 
within different communities tend not to be connected 
by links. This problem has applications in understanding 
the structure of social and biological networks '2|, while 
the closely related graph partitioning problem discussed 
below has applications in parallel processing, to allocate 
assignments to different processors while minimizing in- 
terprocessor communication. 

Unfortunately, despite all this interest, there is no for- 
mal definition of the problem. Instead, each author tends 
to define communities as being whatever is found by 
a particular community detection algorithm J3|. In this 
work, we exploit a standard method of testing commu- 
nities, the four groups test 4^, to express community de- 
tection as an inference, or maximum likelihood problem. 
This leads to a derivation of a Potts model similar to 
those derived previously on phcnomcnological grounds Is'. 

To solve this inference problem, we must find the 
ground state of the Potts model. To do this, wc turn 
to the techniques of belief propagation JS^ , also known as 
sum-product, and mean- field theory. Belief propagation 
was originally developed to perform decoding in a certain 
class of error correcting codes, called low-density parity 
check codes [2]. In this problem, one has a sender and 
receiver communicating over a noisy channel, and the re- 
ceiver must determine which of all the possible messages 
is the most likely. This problem can be mapped onto 
findingthe ground state of a spin system on a particular 
graphtoj. The belief propagation algorithm exploits the 
fact that the graph has a low density of loops to solve 
this problem, in a manner similar to the famous Bethe- 
PeierlsjSl solution for the thermodynamics of a spin sys- 
tem on a Bethe lattice. 

We will find that the resulting belief propagation algo- 
rithm for community detection is highly accurate on the 
four groups problem, while also performing well on other 
test networks. We then discuss the scaling of computa- 
tional time with system size, extensions of this algorithm, 
and other problems. 

Inference Problem — To express community detec- 
tion as an inference problem we consider the follow- 
ing method, often used to test community detection 



algorithms '4l|. We invent a network as follows: we con- 
sider N nodes, divided into q different communities with 
Uk, k = l,...,g nodes in each community. We consider 
each pair of nodes in turn. We connect those nodes with 
probability pi„ if they lie within the same community and 
Pout if they lie within different communities. We then run 
the community detection algorithm and see if it correctly 
assigns nodes to communities. Let qi be the initial as- 
signment of a community to node i. The probability that 
a given graph arises from this procedure is equal to 

P({*})=(1) 
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where n<ji> denotes a product over pairs of vertices 
i,j connected by an edge. To verify the correctness of 
this formula, first consider the case in which there are no 
edges at all in the graph. Then, the probability is given 
correctly by the first two products of Eq. (^). Adding 
edges to the graph changes the probability as given by 
the second two products in Eq. ^. 

We now consider a given graph and formulate the prob- 
lem as follows: find the most likely community assign- 
ment. Following Bayes' theorem, given a graph, the 
probability that any given community assignment is the 
"correct" assignment is proportional to the probability 
p{{qi}) multiplied by the a priori probability of having a 
given Uk ■ Throughout this paper we assume the a priori 
probability of a given n^ is constant, and thus the opti- 
mal community assignment maximizes the probability in 
Eq. (^ ; any non-constant priori probability can be easily 
incorporated by adding additional terms. 

We rewrite Eq. ^ as an exponential: 



P({*}) = cexp[ 



<i3> 



eM}_^J'K,,j2], (2) 



where y^^,„^ denotes a sum over pairs of i,j connected 
by an edge in the graph, and c = exp[log(l —pout)N{N — 
l)/2] exp[X;<y> log(po„t/(l - Pout)], and with 

J = log[(pj„(l -po„t))/((l -Pin)Pout)], (3) 

J' = l0g[(l - Pj„)/(1 - Pout)]- 



The factor of 1/2 in Eq. ^ is to avoid double counting. 
Eq. (0) presents the probabiUty as a Potts model problem 
with combined short- and long-range interactions, with 
coupling constants J, J'. The problem of community de- 
tection is then reduced to finding the ground state of 
this Potts model. Assuming pin > Pout , the short-range 
interactions are ferromagnetic, favoring the assignment 
of neighboring nodes to the same community, while the 
long-range interactions are anti-ferromagnetic and pre- 
vent one from simply taking all nodes to lie within the 
same community. The problem of finding the ground 
state is very closely related to the NP-complete problem 
of graph partitioning, to break a graph up into parti- 
tions, minimizing the number of edges connecting parti- 
tions and minimizing the difference in number of nodes 
between partitions. 

Belief Propagation — Having arrived at Eq. (0) we have 
a very similar problem to that studied in Q- Instead of 
using Monte Carlo methods to find the ground state, we 
adopt the method of belief propagation, which we believe 
to be more eflficient for many of the community detection 
problems that arise. Indeed, for the inference problems 
which arise many in error correcting codes, belief propa- 
gation is the most efficient method. 

We begin by taking a mean-field approximation for the 
long-range interactions, justified for large N. We approx- 
imate pi{qi}) «Pm/t({<7i}), where 



Pmft 



({(jj) = Z-i exp[ J2 J^i^.i^ exp[^ h,{q^)] (4) 



<ij> 



where Z is a normalization, and for r = l...q wc define 
h,{r) = J'Np{r) ~ J'p,{r) and p{r) = Y.^P^{'r)l^^ ^it^ 



Pt{r) 



E{g,},g.^rP'»/*({gj}) 



(5) 



so that Pi(r) is the probability in the mean-field approx- 
imation that node i belongs to community r. These are 
a set of self-consistent equations for Pi{q). 

We will find that solving these equations, at least in 
the belief propagation approximation below, leads to a 
spontaneous symmetry breaking: for sufficiently large 
J, J', the probability Pi (g) depends on q. Of course, given 
any solution which breaks symmetry, one can arrive at 
other valid solutions by relabeling the communities. We 
then make a "nodewise" maximum a posteriori probabil- 
ity (MAP) approximation: for each node i, we compute 
Pi{q) and then assign the node i to the community q 
which maximizes Pi{q). This is an approximation: the 
set of community assignments which maximizes p{{qi\) 
may not be given by maximizing the probability for each 
node separately. However, similar approximations work 
very well for error correcting codes [7|, and these approx- 
imations are justified for large J, J'. 

To compute the Pi{q), we apply belief propagation. 
Suppose the graph forms a tree, with no loops. Then, the 



problem of solving for Pi{q) given hi{q) can be solved: for 
each pair of nodes i,j connected by an edge, we define 
Pij (r) to be probability pi for the network modified by 
removing the edge connecting i to j. That is, 

, , T,{g,},q,=r^M-JSq,,qj)Pmfti{qk}) 

as exp{—JSq.,q^)pmfti{qk}) IS the probability distribu- 
tion on the network with the edge removed. Then, for a 
tree-like structure, the Bethe-Peierls solution gives 

i.j n.n. 

k(^) = Zi^ exp[/i,(r)] Jl [exp( J)pj-i(r) + (1 - Pj»(r)] , 



where Zi is chosen so that X]r=i P^i^ 



(7) 
1, and where the 



product J^*'"' "■"■ is taken over all nodes j which are con- 
nected to node i. Note that exp(J)pji(r) -I- {l—pji{r)) = 
SsPii(s)('^r,s exp(J) + (1 - 5r,s))- Similarly, 

i,k n.n. 

PyW = ^i^^exp[/i,(r)] Jl [cxp(J)pfc,(r) + (l-pfe,(r)], 

(8) 
where again we choose Zij such that X]r=iPy (^) ~ -'^■ 

These belief propagation equations are exact for a tree. 
If the network is not a tree, however, belief propagation 
can still be very effective, especially if the density of loops 
is small. Of course, not only may the graph have loops, 
but the long-range interactions also induce loops. How- 
ever, for large N , the mean-field approximation we have 
used is justified for the long-range interactions, and as 
discussed below the procedure works well even for test 
networks with loops. 

We can simplify the algorithm by replacing the belief 
propagation equations by a set of naive mean-field equa- 
tions: for each site we track only the beliefs Pi{qi) and 
iterate the equations 

P^ir) = Zr' n exp[/i,(r)](exp(J)p,(r) + (1 -p,{r))). 

(9) 
This method is most appropriate for networks with large 
z where it becomes faster than the belief propagation 
method by roughly a factor z as there are fewer equations 
to solve; also, as we will see, for large z the mean-field 
approximation used here performs comparably to belief 
propagation. 

There are many ways to solve the belief propagation 
equations. We choose to initialize each of the Pi{r) = 
l/q + ei{r), where Ci{r) is chosen randomly and is small, 
and similarly for each of the pij (r) . We then perform 
a fixed number of iterations: on each iteration we first 
randomly select one directed edge i,j and then update 
the corresponding function pij (r) by replacing its current 
value with 0.75 times its current value plus 0.25 times 
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FIG. 1: Accuracy of belief propagation and mean-field al- 
gorithms. Circles, squares, and triangles represent different 
numbers of iterations, using constants from Eq. @. Pluses 
represent a different, fixed choice of constants described in 
text. 



the value given by solving Eq. ^. We then randomly 
select one node i and replace the function Pi{r) by its 
0.75 times its current value plus 0.25 times the value 
given by solving Eq. {T)). After updating Pi{r) we update 
p{r) and mi(r), thus solving the belief propagation and 
mean-field equations simultaneously. Other relaxation 
methods may prove better in certain applications. 

Applications — We have tested this algorithm on sev- 
eral problems. We first consider the four groups prob- 
lem. In this case, we take a randomly generated network 
of A^ = 128 nodes, divided into four communities of 32 
nodes each. After generating the network, we run the 
community detection algorithm, to find the most likely 
assignment of communities. We measure the algorithm's 
performance by determining the fraction of nodes whose 
community it identifies correctly. There is some arbitrari- 
ness in how this fraction of correctly identified nodes is 
defined, which is connected with the arbitrariness in the 
labeling of communities: one can permute the community 
labels as desired, given the Potts symmetry of Eq. jSJ. 
To resolve this arbitrariness, we follow the stringent def- 
inition adopted in 10] for the accuracy. 

We choose Pm and Pout so that each node has on av- 
erage Zin connections to other nodes in the same com- 
munity and Zout connections to nodes in different com- 
munities. We pick Zin + Zout = 16, and consider a range 
of values of Zi„ , the average number of intra-community 
links. For large Zi„, the community detection problem is 
easier, as the effect of the community structure is much 
more clear. 

The results of this procedure are shown in Fig. 1. As 
seen, with 100000 iterations, the algorithm is highly ac- 
curate. With 50000 iterations, a slight decrease in accu- 
racy is noticed for small Zm, and the accuracy goes down 
significantly at 10000 iterations (by increasing to 200000 
iterations, a slight improvement is noted for Zm < 8). 
For the first three curves, we used the choice of con- 
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FIG. 2: Accuracy of belief propagation and mean-field algo- 
rithms for Zin + Zout = 4. 



stants in Eq. Q. Since these constants depend on the 
given Zim Zout, something which may not be known for an 
arbitrary network, we repeated the algorithm with a par- 
ticular fixed choice of constants, exp[J] = 6,exp[Af J'] = 
10^^". This choice of constants was chosen completely 
arbitrarily, but as seen the algorithm still works well, 
with a much higher accuracy than the Newman-Girvan 
algorithm. However, the Newman-Girvan algorithm has 
some advantages in terms of picking the most optimal 
number of different communities into which to divide the 
network, while the present algorithm takes the number 
of communities as an input. We ignore, however, the a 
priori knowledge that each community has 32 nodes. 

Finally, we have tested the mean-field algorithm. As 
seen, this is almost as accurate as the belief propagation 
algorithm for Zi„ > 8, and actually performs better for 
Zin < 7.5. We believe that the improved performance is 
a result of the fact that the mean-field equations more 
easily break symmetry. Tests on networks with a lower 
coordination number showed the difference more clearly, 
as given in Fig. 2 for a network with four groups and 

Zin ~r ZquI — ^• 

We have also tested the belief propagation algorithm 
on simple networks, such as dividing TV nodes arranged 
on a straight line with connections between nearest neigh- 
bors on the line into two different communities, as well 
as the Zachary karate club network[ll|. The relaxation 
of the algorithm needed to be done slightly more slowly 
than described above (ten randomly chosen nodes were 
relaxed before each edge was relaxed), and some care 
was taken on the constants due to the lower coordina- 
tion number in order to obtain convergence: with poor 
choices the beliefs Pi{r) oscillated randomly. After find- 
ing these constants on the straight line network, the al- 
gorithm was tested on the Zachary network, and identi- 
fied the communities accurately, with one error of placing 
node 10 in the wrong community, as labeled in the fig- 
ure in |j]. However, this node has one link to each of 
the two communities and so the network gives nodes no 
obvious preference for either community. Interestingly, 



after convergence of the equations, for almost all nodes 
I, the maximum over r of Pi{r) was greater than 0.99, 
except node 9 where it was only 0.98 and node 10 where 
it was only 0.87; these two nodes have roughly equal con- 
nections to both communities. 

Discussion — We have expressed community detection 
as an inference problem, providing a formulation of the 
problem in statistical mechanical terms. We have then 
applied the belief propagation method to solve the result- 
ing statistical problem. The results are accurate, however 
there are a number of questions that should be addressed 
as well as possible extensions. First, there is some ques- 
tions about picking the constants J, J' . In some cases, 
especially on test networks with a low coordination num- 
ber, a poor choice of constants leads to either a lack of 
convergence of the belief propagation equations, or else 
convergence to a solution in which Pi{r) — > 1/q so that 
the spontaneous symmetry breaking is absent. In both 
cases the algorithm performs poorly at finding the com- 
munities. The former case requires a slower relaxation of 
the equations, while the latter case requires an increase 
in the constants J^J' . Fortunately, both of these cases 
can be detected by looking at the Pi{r) as the algorithm 
runs, and then corrected, so that the algorithm warns of 
its possible failure in these cases. We did not find any 
case in which the belief propagation equations converged 
to a poor solution which spontaneously broke symmetry. 

The next question is the scaling of the algorithm with 
system size. Accurate results were found with 50000 it- 
erations. Since each iteration updates one directed edge, 
there are roughly 50000/(128 * 16) ~ 24 iterations per 
edge. Even with 10000 iterations, or roughly 5 per edge, 
some information is found. For a general network, we 
expect that if there are few links between communities, 
then the number of iterations required per edge will be 
proportional to the phase-ordering time for a given com- 
munity under the appropriate dynamics. This phase or- 
dering time typically scales il3(| as some power of the rele- 
vant length scale for a community, and for many networks 
this length scale is of order log(iV). We thus expect for 
a network with average coordination number z that the 
time will typically be of order ziVlog(iV)" for some a. 

There are a number of possible ways of modifying the 
algorithm, also. It may be desirable to incorporate ad- 
ditional a priori knowledge about the communities, or to 



study the modularity of the different divisions, as in |j]. 
In some cases, the belief propa gati on equations may be 
reduced to linear programmingjl^. A final interesting 
question relates to our need for spontaneous symmetry 
breaking. It would be desirable to be able to go beyond 
behef propagation using methods such as [lj|. However, 
doing this may cause us to lose the symmetry breaking, 
and thus the MAP approximation may need to be re- 
placed. Survey propagation [i3 may help in this regard, 
as it allows us to consider a distribution of magnetic fields 
for each site, each possibility corresponding to a differ- 
ent symmetry breaking. Even without these extensions, 
however, the present algorithm leads to useful results. 
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